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Abstract 

We consider a fundamental problem in unsupervised learning: given a collection of m 
points in M", if many but not necessarily all of these points are contained in a d-dimensional 
subspace T can we find it? The points contained in T are called inliers and the remaining 
points are outliers. This problem has received considerable attention in computer science 
and in statistics. Yet efficient algorithms from computer science are not robust to adversarial 
outliers, and the estimators from robust statistics are hard to compute in high dimensions. 
This is a serious and persistent issue not just in this application, but for many other problems 
in unsupervised learning. 

Are there algorithms for linear regression that are both robust to outliers and efficient? 
We give an algorithm that finds T when it contains more than a ^ fraction of the points. 
Hence, for say d = n/2 this estimator is both easy to compute and well-behaved when there 
are a constant fraction of outliers. We prove that it is small set expansion hard to find T 
when the fraction of errors is any larger, thus giving evidence that our estimator is an optimal 
compromise between efficiency and robustness. In fact, this basic problem has a surprising 
number of connections to other areas including small set expansion, matroid theory and 
functional analysis that we make use of here. 
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1 Introduction 



Unsupervised learning refers to the problem of trying to find hidden structure in unlabeled 
data. A ubiquitous approach is to model this hidden structure as a low-dimensional subspace 
that contains many of the data points. This approach has found a range of applications in areas 
such as feature selection, dimensionality reduction, spectral clustering, topic modeling and 
statistical inference. There are two important desiderata for an unsupervised learning algorithm, 
computational efficiency and robustness: computational efficiency refers to the goal of giving 
provable guarantees on the running time of the algorithm and robustness refers to the goal of 
giving guarantees that the algorithm produces a useful output even if the assumptions of the 
model do not hold exactly. Our focus in this paper is on understanding whether or not these 
two goals can be met simultaneously. 

Individually, these goals can each be met. For example, there are many known fast algorithms 
to compute the singular value decomposition, and from this decomposition it is straightforward 
to find a low-dimensional subspace that contains all of the data if it exists. There are also a 
number of provably robust estimators for linear regression. One famous example is Rousseeuw's 
least median of squares estimator [37]. The computational problem that underlies this estimator 
is to find a subspace that minimizes the median Euclidean distance to the data points. An 
adversary must corrupt at least half of the data points in order to corrupt the output. Many 
more robust estimators have been developed for this specific problem (e.g. least trimmed 
squares, M-estimators, the Theil-Sen estimator, reweighed least squares) and for other inference 
problems by the robust statistics community (see e.g. [38] and [24]). 

Unfortunately, the singular value decomposition is not robust to outliers. Moreover, only 
modest improvements over brute-force search are known to actually compute the least median 
of squares estimator in high dimensions [17]. Is there an estimator for linear regression that is 
both efficiently computable and robust to outliers? This is an instance of a fundamental and 
largely unexplored question: 

"Can we reconcile computational efficiency and robustness in unsupervised learning?" 

Our focus here is on a challenging notion of robustness used in the robust statistics com- 
munity: an estimator is robust if an adversary can corrupt an a fraction of the data, and the 
output of the estimator is still well-behaved. The fraction of data that an adversary is allowed 
to corrupt is called the breakdown point [15]. We remark that there has been interesting recent 
work on finding a subspace that approximately minimizes the sum of £ p distances (for p > 2) 
to the data points [14], [23]. Unfortunately ^-regression can be corrupted quite easily by an 
adversary. 

In general, the robust statistics community studies the breakdown properties of particular 
estimators. Here, our goal is not to study a particular estimator, but rather whether or not 
there is any robust estimator for linear regression that is also easy to compute. The following 
definition is central to our paper: 

Definition 1.1. An estimator £ is an a-robust estimator for the d-dimensional linear regression 
problem in JR" if for any set of points in which a 1 - a fraction are contained in a d-dimensional 
linear subspace T C 1R", the estimator returns T. 

Here the breakdown point is a. So the natural question is, for what choices of the parameters 
n, d and a is there such an estimator that is also easy to compute? There are compelling reasons 
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to choose robust estimators over their classical counterparts, but so far their potential has not 
been realized because there are no good algorithms to compute them. 

1.1 Complexity of Robust Linear Regression 

We assume that the points outside T are in general position, and that the points inside T are in 
general position with respect to T. 1 Recall that the dimension of T is d. Throughout this paper, 
we will use L to denote the points inside T and we will call these the inliers, and the remaining 
points outliers. Our first result is a simple randomized algorithm that achieves a breakdown 
point of a = 1 - ^. Our result relies on Condition 2.1: any set of n points is linearly independent 
if and only if at most d of the points are inliers. 

Theorem 1.2. If a set of m points in ]R" has strictly more than inliers and meets Condition 2.1, 
then there is a Las Vegas algorithm whose output is the set L of inliers, each iteration can be imple- 
mented in polynomial time and the expected number of iterations is 0(n 2 m). 

In fact, an interesting comparison can be drawn between our algorithm and the famous RANSAC 
method of Fischler and Bolles [19]: Both approaches repeatedly select a random set of n points; 
RANSAC works when this sample contains only inliers, whereas our algorithm works when 
the sample contains at least d + 1 inliers. The main observation is that even if a set of n points 
contains many outliers, only the inliers can participate in a linear dependence. In fact, for 
d = n/2 and even if inliers make up 3/4 of the points, RANSAC will take an exponential number 
of iterations to find T while our algorithm requires only a constant number of iterations (see 
Remark 2.3). 

Our algorithm can also be made stable in that the inliers do not need to be exactly contained 
within T. Here we need Condition 2.5: the smallest determinant of any set of points with at 
most d inliers is strictly larger than the largest determinant of any set of points with at least 
d + 1 inliers. 

Theorem 1.3. If a set of m points in 1R" has strictly more than ^m inliers and meets Condition 2.5, 
then there is a Las Vegas algorithm whose output is the set L of inliers, each iteration can be imple- 
mented in polynomial time and the expected number of iterations is 0(n 2 m). 

Our estimator achieves a constant breakdown point for, say, d = n/2. Yet there are numerous 
inefficient estimators that achieve a better breakdown point (e.g. a constant breakdown point 
even when d = n- 1). We provide evidence that our estimator is the optimal compromise between 
efficiency and robustness: it is small set expansion hard to improve the breakdown point beyond 
this threshold. We state our result informally here: 

Theorem 1.4. There is an efficient reduction from an instance of (e, <5)-Gap-Small-Set Expansion 
on a graph G to Gap Inlier such that: 

• if there is a small non-expanding cut in G then there exists a subspace of dimension d containing 
at least (1 - e)^ fraction of the points 

1 If we remove these conditions then when dim(T) = n - 1 the problem is equivalent to trying to satisfy as many 
equations as possible in an overdetermined linear system. See [22], [28] and references therein. However, these 
reductions produce instances which are quite far from ones we might expect to observe in real data, and one approach 
for circumventing these hardness results is to instead require the above condition which is satisfied almost surely in 
most natural probabilistic models and seems to make the problem computationally much easier. 
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• and if there is no small non-expanding cut then every subspace of dimension d contains at most 
a 2e^ fraction of the points. 

Khachiyan [27] proved a related result that it is NP-hard to find a d = n - 1 dimensional 
subspace that contains a (1 -e)^ fraction of the data points 2 . In general, it seems difficult to 
base hardness for robust linear regression (when d < n - 1) on standard assumptions and this is 
an interesting open question. 

Taking a step back, computational complexity is an important lens for understanding 
learning and statistical problems in the sense that there are many sample-efficient estimators, 
e.g., maximum likelihood, that are hard to compute, but by allowing more samples than the 
information theoretic minimum we can find alternatives that are easy to compute. Yet these 
hard estimators are still favored in practice, perhaps not just due to their sample efficiency 
but also due to their robustness. A broader goal of our paper is to bring to light questions 
about whether there are estimators that meet all three objectives of being efficiently computable, 
sample efficient and robust (and not just two out of three). 

1.2 Derandomization and Duality for Robust Linear Regression 

The crucial step in our randomized algorithm is to repeatedly sample subsets of n points and 
once we find one that is linearly dependent, we can use this subset to recover the set of inliers. 
If a collection of m points in M n has the property that a random subset of n points is linearly 
dependent (with non-negligible probability), can we find such a subset deterministically? We 
give a solution to this problem using tools from matroid theory [18], [11], [21]: 

Indeed, a well-studied polytope in matroid literature is the basis polytope which is the convex 
hull of all sets of n points that form a basis (see Section 4). Condition 2.1 guarantees us that 
the vector ^1 is outside the basis polytope, and our goal of finding a set of n points that do not 
span IR" can be stated equivalently as finding a Boolean vector (whose coordinates sum to n) 
that is also outside the basis polytope. 

There has been a vast literature on the basis polytope and on submodular minimization, 
and there are deterministic strongly polynomial time algorithms for deciding membership in 
the basis polytope [18], [11], [21], [39], [26]. Our idea is in each step we find a line segment £ 
that contains the current vector (starting with ^1). Since the current vector is outside the basis 
polytope it is easy to see that at least one of the endpoints of £ must also be outside. So we can 
move the current vector to this endpoint and if we choose these segments £ in an appropriate 
way we will quickly find a Boolean solution. The key is that a membership oracle for the basis 
polytope tells us which endpoint of £ we should move to. Hence we obtain an algorithm that is 
not only an optimal tradeoff between efficiency and robustness, but is even deterministic: 

Theorem 1.5. If a set of m points in 1R" has strictly more than |m inliers and meets Condition 2.1, 
then there is a deterministic polynomial time algorithm whose output is the set L of inliers. 

The basis polytope not only plays a central role in robust linear regression but is also 
closely related to a notion studied in functional analysis that we call radial isotropic position. In 
fact, Barthe [2] studied a convex programming problem whose optimal solution finds a linear 
transformation that places a set of points in radial isotropic position (see Section 6) if it exists. 

2 This follows by applying a padding argument to the knapsack instance before proceeding with the reduction in 
[27]. 
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The connection is that the optimal value to this convex program is finite (i.e. there is such a 
transformation) if and only if the vector ^1 is inside the basis polytope. 

Barthe's convex program provides a connection between radial isotropic position and robust 
linear regression: just as placing a set of points in isotropic position is a proof that the set of 
points is not contained in a low-dimensional subspace, so is placing a set of points in radial 
isotropic position a proof that there is no d-dimensional subspace that contains more than a | 
fraction of the points (see Section 4). We give effective bounds on the region in which an optimal 
solution to the convex program is contained, and how strictly convex the function is and use 
this to give an efficient algorithm to compute radial isotropic position. 

Theorem 1.6 (informal). There is a deterministic polynomial time algorithm to compute a linear 
transformation R that places a set of points in radial isotropic position, if such a transformation exists. 

Notably this theorem shows that if there is no low-dimensional subspace that contains many of 
the points, we can deterministically compute a certificate that there is no such subspace. 

Radial isotropic position can also be thought of as a more stable analogue of isotropic 
position that is not sensitive to either the norms of the data points or to a constant fraction of 
adversarial outliers! Isotropic position has important applications both in algorithms and in 
exploratory data analysis, but is quite sensitive to even a small number of outliers (see e.g. [40]). 
Just as robust statistics asks for estimators that are well-behaved in the presence of outliers, 
we could ask for canonical forms (e.g. isotropic position, radial isotropic position) that are 
well-behaved in the presence of outliers. Perhaps radial isotropic position will be a preferable 
alternative in some existing applications where being robust is crucial. 

Somewhat surprisingly, this elementary problem of finding a low-dimensional subspace that 
contains many of the data points is connected to a number of problems and combinatorial objects 
including the small set expansion hypothesis, the independent set polytope and submodular 
minimization and notions in functional analysis, and we make use of all of these connections. 

1.3 Related Work 

Our work fits into a broader agenda within statistics and machine learning: Can we recover 
a low-rank matrix from noisy or incomplete observations? The foundational work of Recht, 
Fazel and Parrilo [36] and Candes and Recht [6] gave convex programming algorithms that 
provable recover a low-rank matrix when given a small number of random chosen entries in 
the matrix. These techniques have since been adapted to settings in which an adversary can 
corrupt some of the entries in the matrix [9], [5], [41]. However we note that there are two 
incomparable models for how an adversary is allowed to corrupt the entries in a low-rank 
matrix, and which model is more natural depends on the setting. For example, the exciting 
work of Candes et al [5] considers a model in which an adversary can corrupt a constant fraction 
of the entries of A whose locations are chosen uniformly at random. In contrast, the model 
in [41], [42] for example allows an adversary to corrupt a large fraction of the columns of A. 
This is the setting in our work, and this assumption is most natural when we think of columns 
of A as representing individuals from a population and uncorrupted columns correspond to 
individuals that fit the model, but we would like to make as few assumptions as possible about 
the remaining individuals that do not fit the model. We note that much of the recent work 
from statistics and machine learning has focused on stochastic settings for this problem, where 
one posits a distributional model that generates both the inliers and outliers and the goal is to 
recover the subspace T with high probability. Yet, our deterministic condition (Condition 2.1) is 
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usually satisfied in these probabilistic models. For example, the recent work of Lerman et al 
[30] considers a model in which inliers are chosen according to a standard Gaussian restricted 
to T, and outliers are chosen according to a standard Gaussian on 1R". Samples chosen from this 
model satisfy Condition 2.1 with an exponentially small failure probability and we believe that 
our randomized algorithm is a simpler and more attractive alternative to convex programming 
approaches for these problems. 

The above discussion has focused on notions of robustness that allow an adversary to corrupt 
a constant fraction of the entries in the matrix A. However, this is only one possible definition 
of what it means for an estimator to be robust to noise. For example, principal component 
analysis can be seen as finding a d-dimensional subspace that minimizes the sum of squared 
distances to the data points. A number of works have proposed modifications to this objective 
function (along with approximation algorithms) in the hopes that this objective function is 
more robust. As an example, Deshpande et al [14] gave a 0(p p/2 ) approximation algorithm for 
the problem of finding a subspace that minimizes the sum of £ p distances to the data points 
(for p > 2). Another example is the recent work of Naor, Regev and Vidick [32] which gives a 
constant factor approximation for finding a d-dimensional subspace that maximizes the sum of 
Euclidean lengths of the projections of the data points (instead of the sum of squared lengths). 
Lastly, we mention that Dunagan and Vempala [16] gave a geometric definition of an outlier 
(that does not depend on a hidden subspace T) and give an optimal algorithm for removing 
outliers according to this definition. 

2 A Simple Randomized Algorithm 

Here we give a randomized algorithm for robust linear regression. The idea is that once we 
find any non-trivially sparse linear dependence we can use it to find the set of inliers provided 
that the inliers are in general position with respect to T. The breakdown point of this estimator 
is exactly the threshold at which a random set of n points is linearly dependent with non- 
negligible probability. Surprisingly, in Section 3 we give evidence based on the small set 
expansion conjecture that there is no efficient estimator that has a better breakdown point. 

We will think of an instance of robust linear regression as a matrix A e ]R" xra with m^n and 
rank n. Throughout this paper for V c [m], we will let A v denote the submatrix corresponding 
to columns in V. Suppose that there is a d-dimensional subspace T that contains strictly 
more than a j fraction of the columns of A. Our goal is to recover this subspace (under mild 
general position conditions on these points) efficiently Let L c [m] be the columns of A that are 
inliers. We will need the following condition which is almost surely satisfied by any reasonable 
probabilistic model that generates inliers from the subspace T and outliers from all of JR": 

Condition 2.1. A set ofn columns of A is linearly independent if and only if at most d of the columns 
are inliers. 

The next lemma gives a lower bound on the probability of sampling strictly more than 
expected number of inliers: 

Lemma 2.2. Suppose that we are given a set of m points in W with strictly more than ^m inliers. Let 
V be a uniformly random set ofn points (without repetition). Then the probability that U contains at 
least d + 1 inliers is at least v > . 
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Algorithm 1 RandomizedFind, Input: A e ]R nxm which satisfies Condition 2.1 

1. Set U = [m] 

2. start : Choose V C U with \ V\ = n uniformly at random 

3. If rank(A v ) < n, 

4. ue ker(A v ), Set C = span({Ai : u, * 0}), Set L = {i: A ; e £} 

5. Output L 

6. Else 

7. Return to start 



Proof: Let X be a random variable defined to be the number of inliers in a random set V of n 
points. Then £[X] > d and set X = X- E[X]. Then let p be the probability that X ^ 0, and this 
condition certainly implies that we have at least d + 1 inliers. Since the expectation of X is zero, 
we have that 

pE[X\X >0] + {l -p)E[X\X < 0] = 
Then we can upper bound £[X|X ^ 0] < n - d and -pE[X\X < 0] < pn. Hence 

p(n -d) + pn> -E[X\X <0}> ^—^ > 

m n nm 

and this completes the proof of the lemma. ■ 

Remark 2.3. We remark that the lower bound on p can be improved to p ^ (d/n) 2 /2 when m ^ 6n + 2 
and n ^ 3. Hence our algorithm is quite practical in this range of parameters. 

Indeed, with the same notation as above, condition X on the event £ that the first two 
samples are contained in L (the set of inliers). Clearly, F{£} > (d/n) 2 . On the other hand, we 
still sample n — 2 points with replacement. Each sample now has a probability of landing 
in L that is at least q > d ^" 2 2 = n ~ n ~ h- Here ' we used that m > 6n + 2 - Hence ' 

E[X \ E]>(n-2)(%-jjj)>d-l, where we used that 2/n + 1/3 ^ 1. On the other hand, we have 
P{X ^ LE[X | £]J | £} > \ by the " mean is median" theorem for hypergeometric distributions 
(see, e.g., [25]). It follows that F{X > d + 1} > F{X > d - 1 | £}P{£} > \{d/n) 2 . 

The next claim captures the intuition that from any non-trivially sparse linear dependence 
in A it is easy to compute the set of inliers: 

Claim 2.4. If\V\ = n, then any vector in the kernel of A v must contain d + l inliers in its support 
and no outliers. 

Proof: Suppose there is a vector u € ker(A v ) with u,- * for i £ L. Since any d + l inliers are 
linearly dependent, we can use Caratheodory's Theorem (see [31]) to find a vector v 6 ker(Ay) 
supported on at most d inliers and for which v, ^ 0. This contradicts Condition 2.1 since the 
support of any non-zero vector in the kernel must contain at least d + l outliers (otherwise 
we can extend the support to a basis). Furthermore, any vector in the kernel of A v must be 
supported on at least d + l. m 
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Algorithm 2 RandomizedFind2, Input: A e IR" xm which satisfies Condition 2.5 

1. Set U = [m] 

2. start : Choose V C U with \ V\ = n uniformly at random 

3. If det(AlA v ) < C 2 

4. While |V|>d + l 

5. Find {m} such that det{A^_^A v _ w ) < C 2 

6. SetV=V-{u] 

7. Set C = V U {v\det(Al A[uv] A VA{UtV] ) < C 2 \ where ueV 

8. Else 

9. Return to start 



We can now prove Theorem 1.2: 

Proof: Claim 2.4 guarantees the correctness of the algorithm, and Lemma 2.2 guarantees that 
the success probability of each iteration is at least p > and this implies the lemma. ■ 

We are interested in generalizing our algorithm to the setting where inliers are only approxi- 
mately contained in the subspace. This idea is formalized next. 

Condition 2.5. Any set V of at most n columns of A has det(AyA v ) ^ C 2 if the number of inliers is 
at most d, and otherwise strictly less than C 2 . 

We can now prove Theorem 1.3, which is stable even when the inliers are not exactly 
contained in a subspace T: 

Proof: Lemma 2.2 guarantees that the probability that the algorithm finds a set V with |V| = n 
and det(Ay) < C in is at least p > j\~ > an d furthermore the algorithm maintains the invariant 
that the set V always has at least d + 1 inliers, and at the end of the while loop V is a set of d + 1 
inliers. Then Condition 2.5 guarantees that the algorithm correctly outputs the set of inliers. ■ 



3 Computational Limits 

We will now present evidence that the robust linear regression problem is computationally 
hard beyond the breakdown point achieved by our randomized algorithm in Section 2. For 
this purpose we need to introduce the expansion profile of a graph. Given a A-regular graph 
G = (V, E) we define the edge expansion of a set S c V, as 

, , \E G (S,V\S)\ 

^ G(S) = aisi • 

Here and in the following, we let Eq(A,B) denote the set of edges in G with one endpoint in A 
and the other in B. Let us also denote fi(S) - |S|/|V|. Given a parameter 6 e [0, 1/2], we define the 
expansion profile of G as the curve 

<f> G {6) = min <f>(S). 

ja(S)=6 
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With these definitions we describe the Small Set Expansion problem as was recently studied 
by [34, 35]: 

Definition 3.1. The Gap-Small-Set Expansion problem is defined as: Given a graph G, and 
constants e, b > 0, distinguish the two cases 

1. <f> G (8)>l-e, 

We will relate the previous problem to the Gap-Inlier problem that we define next. 

Definition 3.2. The Gap-Inlier problem is defined as: Given m points Ui,...,u m e W, and 
constants e, b, distinguish the two cases 

1. there exists a subspace of dimension bn containing a (1 —e)b fraction of the points, 

2. every subspace of dimension bn contains at most a eb fraction of the points. 

Our next theorem shows a reduction from Gap-Small-Set Expansion to Gap Inlier. 

Theorem 3.3. Let e, b > 0. There is an efficient reduction -which given a A-regular graph G = (V,E), 
produces an instance U\,...,u m € W of Gap-Inlier such that 

Completeness: If(pQ(b) ^ e, then there exists a subspace of dimension bn containing at least (1 -e)b 
fraction of the points. 

Soundness: If 4>c(b') > 1 —e for every b' € [2b/ A, 2b], then every subspace of dimension bn contains 
at most a 2eb fraction of the points. 

Proof: Our reduction works as follows. Let G = (V,E) be an instance of Gap-Small-Set Expan- 
sion. Let m = \E\ and n = \V\. For each edge e - create a vector u e - a e e^ + /3 e ey, where e,- 
is the z'-th standard basis vector and a e ,ji e are drawn independently and uniformly at random 
from [0,1]. This defines an instance U\,...,u m e M n of Gap-Inliers. 

To analyze our reduction, it will be helpful to consider the following intermediate graph. 
Let B - (E, V) be the bipartite graph where we connect each edge e e E with the two vertices 
in V that it is incident to. Note that B is (2, A)-regular. The next claim relates the dimension of a 
set of points to the size of the neighborhood of the corresponding edge set in the graph B. 

Claim 3.4. For every set of points P C {u\, . ..,u m } corresponding to a set of edges F c E we have with 
probability 1 over the choice of the coefficients above 

dim(span(P)) = |£ B (F,V)|. 

Proof: On the one hand, the points P are contained in the coordinate subspace of dimension d = 
\E B (F , V)\ corresponding to the union of the support of the vectors. On the other hand, we 
claim that that they also span this coordinate subspace. Fix any set of d points touching 
all d coordinates. It is not difficult to show that these points are linearly independent with 
probability 1 over the randomness in the coefficients. 3 ■ 

3 Note that without perturbation an even cycle, for example, causes a linear dependence. 
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Completeness. We begin with the completeness claim. Let S c V be a set of measure b and 
suppose that (pc{S) ^ e. Double-counting the edges spanned by S, we get 

A|S| = 2|£ G (S,S)| + |£ G (S,V\S)|. 

Hence, \E(S,S)\ > A\S\/2 - eA\S\/2 = (1 -s)A\S\/2. On the other hand the edge set E G {S,S) has at 
most |S| neighbors in B. This implies that the points corresponding to E(S,S) are contained in a 
coordinate subspace of dimension |S|. Equivalently, there exists a (<5n)-dimensional subspace 
containing at least (1 -e)6An/2 points. Since m = An/2, this corresponds to a fraction of (1 - e)b 
which is what we wanted to show. 



Soundness. Next we establish soundness. Consider any set of points P contained in bn 
dimensions. We will show that under the given assumption on the expansion profile of G, 
it follows that \P\ ^ eAbn. Again, since m - An/2, this directly implies that any subspace 
of dimension bn contains at most a 2eb fraction of the points. Let F be the set of edges 
corresponding to P in the graph B and let S be its vertex neighborhood in B. By Claim 3.4, it 
is sufficient to show that |F| ^ eAbn. First, note that the neighbor set S c V of F in the graph B 
satisfies 

^<|S|<2<5n. (1) 

The second inequality follows from the fact that each edge e has exactly two neighbors and we 
have equality if the edges form a matching. The first inequality follows because G is A-regular. 
Thus, a set of |S | vertices can induce at most A|S|/2 edges and all edges in F are induced by S. 
Counting the edges touching S as before, 

A|S| = 2|£ G (S,S)| + |£ G (S,V\S)|>2|£ G (S,S)| + A(l-e)|S|. 

The inequality followed from our assumption on the expansion profile of G which we may apply 
because S satisfies Equation 1. Consequently: 

|Sc(S.S)|<^. 

On the other hand, \F\ ^ |£ G (S,S)|, since every edge in F is induced by S. Hence, the previous 
inequality showed that eAbn/2 > \F\. ■ 



4 The Basis Polytope 

Here we connect the independent set polytope which has received considerable attention in 
matroid literature, to a notion studied in functional analysis that we call radial isotropic position. 
In Section 5 we will use known algorithms for deciding membership in the independent set 
polytope to derandomize our algorithm from Section 2. And in Section 7 we will give an efficient 
algorithm to compute radial isotropic position, which can be thought of as a robust analogue to 
isotropic position. Let A - \u\, . . . , u m ] e ]R" xm with m^n. 

Definition 4.1. Let P be the independent set polytope defined as: 

P A = convjlt/ : U C [m], dim (span {u, : i e U} ) = \U\\, 

where ly is the m-dimensional indicator vector of the set U. Also let be the basis polytope 
which is the facet of P corresponding to £1, X; = n. 
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These polytopes can be denned (in a more general context) using the language of matroid 
theory where independent sets of vectors are replaced by independent sets in a matroid. A 
fundamental algorithmic problem in matroid theory is to give an efficient membership oracle 
for these polytopes. A number of solutions are known which all follow from a characterization 
of Edmonds [18] that reduces membership to solving a submodular minimization problem: 

min rank({u ; - : i e U}) - ) X; 
Uc[m] 

The optimum value of this minimization is nonnegative if and only if x e P [18]. Hence an 
immediate consequence of the known algorithms for submodular minimization [21], [39], [26] 
and even a direct algorithm of Cunningham [11] yield: 

Theorem 4.2. There is a deterministic polynomial time algorithm to solve the membership problem 
for the independent set polytope P (and the basis polytope K/J. 

We will use this tool from matroid theory to derandomize our algorithm from Section 2. 
Recall that the main step in our algorithm is to repeatedly sample subsets of n points and once 
we find one that is linearly dependent, we can use this subset to recover the set of inliers. So our 
approach is to use a membership oracle for the basis polytope to find a subset of n points that is 
linearly dependent deterministically. 

The basis polytope not only plays a central role in robust linear regression but also in a 
notion studied in functional analysis called radial isotropic position. These two concepts can 
be thought of as dual to each other: Recall that the set of vectors U\, u m G 1R" is in isotropic 
position if 



L 



uj <g> Uj - ld n 



It is well-known that a set of points can be placed in isotropic position if and only if the points 
are not all contained in an n - 1 -dimensional subspace. Just as isotropic position can be thought 
of as a certificate that a set of points is full-dimensional, so too radial isotropic position can be 
thought of as a certificate that there is no low-dimensional subspace that contains many of the 
points. 

Definition 4.3. We say that a linear transformation R: 1R' ! — > 1R" puts set of vectors U\,...,u m e 
JR" in radial isotropic position with respect to a coefficient vector c € M m if 



(t] \\Rui\\ \\Rui„ 

If a set of vectors meets Condition 2.1 then it cannot be put in radial isotropic position: any 
linear transformation A preserves the invariant that the inliers lie in a subspace of dimension d, 
but after applying A and rescaling the points to be unit vectors the variance of a random sample 
restricted to this subspace is strictly larger than d, which is too large! More generally, when can 
a set of vectors be put in radial isotropic position? Barthe [2] gave a complete answer to this 
question: 

Theorem 4.4 (Barthe). A set of vectors Ui,...,u m eW can be put in radial isotropic position with 
respect to c € IR m if and only ifce K^. Moreover, c e if and only if the following supremum has 
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finite value: 



' m 

sup (c, f)-logdet > 

ti t,„€TR 



e 'Uf <8> Uj 



This concave maximization problem provides a connection between radial isotropic position 
and robust linear regression. The optimal value reveals to us which case we are in: if it is finite, 
then the points can be put in radial isotropic position but if it is infinite (under Condition 2.1) 
then there is a subspace T of dimension d that contains more than a ^ fraction of the points! 

5 A Deterministic Algorithm 

In this section we apply tools from matroid theory (see [18], [11]) to derandomize our algorithm 
from Section 2. Recall that the main step in our algorithm from Section 2 is to repeatedly sample 
subsets of n points and once we find one that is linearly dependent, we can use this subset to 
recover the set of inliers. Our goal is to find such a subset deterministically and we can think 
about this problem instead in terms of the basis polytope. 

Condition 2.1 guarantees that the vector ^1 is outside the basis polytope. We remark that 
a set of n columns is linearly dependent if and only if the indicator vector is outside the basis 
polytope. So we can think about this derandomization problem instead as a rounding problem: 
we are given a vector ^1 that is outside the basis polytope and we would like to round it to a 
Boolean vector (that sums to n) that is also outside the basis polytope. 

Our approach is simple to describe, and builds on known polynomial time membership 
oracles for the basis polytope developed within combinatorial optimization [18], [11], [21], [39], 
[26]. In each step we find a line segment € that contains the current vector (starting with ^1). 
Since the current vector is outside the basis polytope it is easy to see that at least one of the 
endpoints of £ must also be outside. So we can move the current vector to this endpoint and if 
we choose these segments t in an appropriate way we will quickly find a Boolean solution. 

Indeed, Edmonds gave a general characterization of the independent set polytope: 

Theorem 5.1. [18] The independent set polytope P can equivalently be described as: 



Hence we can intersect this alternative description of P with the constraint J^i x i — n to 
obtain an alternative description of the basis polytope that will be more convenient for our 
purposes. Indeed, if Condition 2.1 is met then any subset U of points has rank equal to 
min(«, min(|!7 n L\, d) + \ U/L\) and so: 

Corollary 5.2. If a set ofm^n points meets Condition 2.1, then 



P d =lxelR m : for all U C [m], dim (span {u { : i€l/})> Yje; 



ieU 




in 



P 



i=l 



ieL 




m 



i=l 



Xi = n and 




ieL 
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Algorithm 3 DerandomizedFind, Input: A e ]R" xm which satisfies Condition 2.1 



1. 


Set U = [m] 




2. 


While \U\>n 




3. 


For each i e U 




4. 


Check if 




5. 


If 'NO', Set U 


= U\{i] (exit for loop) 


6. 


Find u e ker(Au), Set T ■■ 


= span({Aj : «; * 0}), Return L-{i: A,- e T] 



Lemma 5.3. After exiting the while loop, \U n L| > d + 1 

Proof: An immediate consequence of Corollary 5.2 is that for each call to the membership oracle 
for K v for some set V , the answer is 'NO' if and only if the fraction of inliers in V is more than 
f r Since at the start of the while loop we are guaranteed that the fraction of inliers in U is more 
than |, this is an invariant of the algorithm. All that remains is to check that for any set U with 
\U\ > n and more than a | fraction of inliers, there some element i that we can remove from U 
to maintain this condition (i.e. the algorithm does not get stuck). This is easy to check since if U 
contains even just one outlier, we can choose i to be that element and this will only increase 
the fraction of inliers and if instead there are no outliers left then we can choose any inlier to 
remove. Hence the algorithm does not get stuck, outputs a set U with \ U\ — n which has strictly 
more than a ^ fraction of inliers and so |!7 n L| > d + 1. ■ 

Theorem 5.4. Given a set of m points U\,...,u m e IR" with m > n that meets Condition 2.1 and 
which T contains more than a ^fraction of the points, then DerandomizedFind computes T. The 
running time of this algorithm is bounded by a fixed polynomial in n, m. 

Proof: Since |[/UL| ^ d+l, we have that rank(Ajj) < n. Then using Claim 2.4, DerandomizedFind 
computes the span T of the inliers, and outputs exactly the set of inliers. Note that there are a 
number of known strongly polynomial time algorithms for deciding membership in K A (see 
Section 4). ■ 

6 Barthe's Convex Program 

Recall that the basis polytope K A characterizes exactly when we can put a set of points in radial 
isotropic position [2]. There are several known algorithms from the matroid literature that 
provide a strongly polynomial time algorithm for deciding membership in K A . However, the 
focus of this section and the next is not just deciding if the optimization problem of Barthe has 
finite or infinite value, but finding an optimal solution in case that the optimum is finite. From 
the solution to this optimization problem, we will be able to derive the linear transformation that 
places a set of points in radial isotropic position. Here we will explain in detail the connection 
found by Barthe [2] and others [8, 7] between convex programming and radial isotropic position. 
In the next section we will prove various effective bounds on this convex programming problem 
that we need in order to show that the Ellipsoid method finds an optimal solution. 
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Recall that Barthe considers maximizing a concave function (or equivalently minimizing a 
convex function): 



I m \ 



sup (c, f)-logdet 

t m eM 



®U; 



\ i=l 



for a given set of points U\, . ..,u m € K" and a coefficient vector c e lR m . How is this unconstrained 
maximization problem related to the linear transformation that puts the points U\,...,u m into 
radial isotropic position? For now we specialize our discussion to the case in which c = ^-1, 
where 1 is the all ones vector. Let t\, ...,t m eM. Consider the matrix U = YJf=\ e 'Uj®Uj. We know 
that this matrix is positive definite and has full rank. Therefore it has a symmetric positive 
definite square root and we can define R = U~ 1/2 . Notice that 

m 

Id„ = U~ U2 UU- l/2 = Y^e'iRujQRUj 
j=i 

Hence, we have what we need if we can choose tj such that e t ' - ^||]?M;|| _2 . The crucial insight is 
that these conditions are exactly the optimality conditions in Barthe's maximization problem. 

Lemma 6.1 ([2]). Let A = [ui,...,u m ] denote a matrix with column vectors U\,...,u m € K". Suppose 
(p* A (c) < co. Then, any optimal solution t\, . ..,t m to (p* A {c) satisfies Cj - (e^Uj, (Ae T A*y 1 Uj) for every 
1 < < m. 

For completeness, we present Barthe's proof and to simplify notation we will continue 
specializing our discussion to c = ^1. Consider maximizing the function / over M m defined as: 

m 

^ )= £E^-- lo s det[/ 

m l — ' ' 

7=1 

It is not hard to show that / is concave (a short proof is given in Lemma 7.4). What is crucial is 
that if t maximizes f(t), then it must satisfy that the gradient of / at t vanishes. We can apply a 
well-known formula for the derivative of logdet (see e.g. [29]) and: 

df(t) n I ,9U\ 
= -^ = — -Tr \U~ l — . 
otj m \ otj ) 



Also in our case: 

dAe T A* _ y de^u^Uj 
7 j=l 1 

And so we conclude that the optimality condition is for all / € [m): 

= — -Tr I U^e^u^Uj) = — - e^his, U~ l Uj). 
m m 

where the last step uses the identity Tr(ABC) = Tr(BCA). Recall that (wy, U~ 1 Uj) = \\Ruj\\ 2 and so 
any optimal t e M m satisfies e*i = ^\\Ruj\\~ 2 which is precisely the condition we needed. And by 
the concavity of /, the supremum of /(f) is attained if it is finite. 
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7 Computing Radial Isotropic Position Efficiently 



Here we prove two important properties of the convex programming problem considered by 
Barthe, that we will need in order to prove that the Ellipsoid method can solve it. We prove 
that if the optimum is finite, there is a solution in a bounded region that is optimal. Also we 
establish a lower bound on how strictly convex the objective function is, since we will need this 
to show that any candidate solution that is close enough to achieving the optimum value must 
also be close to the optimum solution. 

7.1 Effective Bounds 

Here we prove bounds on the region in which an optimal solution can be found. Our proof 
follows the same basic outline as in [3, 2, 8] but is self-contained. We define (p^: TR m — » IR as 



Given c e W, consider the optimization problem (p* A (c) - sup tg]R ,„(£,c) - (p^(ti,...,t m ). The 
function <p* A is the Legendre transform of <pj^. For convenience, we will write logdetf^"^ 

Uj) = logdet(Ae r A*) where T denotes the diagonal matrix with entries t\,...,t m and A* is the 
transpose of A and e denotes the matrix exponential of T (i.e. a diagonal matrix in which the 
i th entry on the diagonal is e fi ). We also introduce the notation: 

Definition 7.1. Let dj = det(AjA^), tj = e^-i elt i and D = mm I: d^odj, where Aj is the sub matrix 
whose columns are indexed by I. 

When we use the subscript I without further specification, we will always mean a subset of 
[m] of size n. We will make repeated use of the Cauchy-Binet formula in this section: 



This generalizes the well-known identity that the determinant of the product of two matrices 
is the product of the determinants. We can apply this formula: 

Claim 7.3. detf Zf=i e*'tty® «/) = det(Ae T A*) = Lic[«],|j|=„ 

We can now show that the mapping (p A is convex, and hence (p* A (c) is concave (which we 
asserted in Section 6): 

Lemma 7.4. The function (p A is convex on IR'". 

Proof: Let s, t e M m . Then, applying the Chauchy-Schwarz inequality, 




Fact 7.2. Let A,B* e lR mx,! . Then det(AB) = £ J: , J , = „det(A / )det(BJ). 



4> (i±£ ) = logdet(A e ( s+r >/ 2 A*) = logj^V^VM;) 




<p{s) + <t>(t) 
2 



where the second equality uses Claim 7.3. ■ 
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Our main step is to show that if c is contained in K A with "sufficient slack", then the optimum 
is finite and we get a bound on the norm of an optimizer. To state the condition we need, we 
will use a slightly unconventional definition for how to dilate K A . This simplifies our arguments 
(in part because it preserves the "trivial" constraints that the coordinates sum to n and are each 
in the interval [0, 1]): 

Definition 7.5. Let CK A denote the vectors c whose coordinates sum to n and are each in the 
interval [0, 1] and for all nonnegative directions u with u min = 0, Cmax ve ^ A (u,v) > (u,c). 

Lemma 7.6. Let a > and suppose c e (1 -a)K A . Then 

1. ^(c)<logi 

2. t* with f(t*) = (p* A {c) satisfies 1 1 * * 1 1 oo ^ \ l°g n 

Proof: From the assumption that c e K A , it follows directly that YJfLi c i = n an d that Cj e [0, 1]. 
Throughout this proof, let /(f) = (c, t) - (p A (t). Let t e K w . We need to upper bound /(f). Note 
that we may assume that miny tj = by adding a constant a e ]R to all coordinates without 
changing the value of /(f). For notational convenience, assume that the coordinates of t are 
sorted in decreasing order fj > f 2 > . . . > f m = 0. This is without loss of generality since we can 
always apply a permutation to the columns of A and the coordinates of t without changing the 
function value. 

Claim 7.7. /(f) < log( jj) - amax^ tj 

Proof: Let I* c [m] be the set of the n pivotal vectors in \u\, ...,u m ], i.e., the indices of the vectors 
that are not in the span of the vectors to the left of them. 
By the monotonicity of the logarithm and Claim 7.3: 

<pA(h>--->t m ) = \og(J^t I d I ) >log(fj.dj.) = ^f ; +log(dj.) > ^f ; - + log(mindj). 

jel' jel' 

Furthermore, we claim that 



^ tj - Cj tj > a max tj . 

jel' j=i ; 



(2) 



Together these two inequalities directly imply that the statement of the claim. It therefore only 
remains to prove (2). First note that I* maximizes (lj, t) = Y,jel tj among all I such that d t ^ 0. 
On the other hand, we know that c 6 (1 - a)K A . Hence (c, f) < (1 - a){lj>, t) and this implies 

m 

tj - j Cjtj ^ a / tj ^ ati 
jel' ;=l jel' 

which establishes (2). ■ 

The previous claim shows that as any tj tends to infinity, /(f) tends to zero. Hence (p A (c) < co 
and, by the convexity of (p A , the supremum is attained meaning that we can find t* such that 

f(t*) = (p* A {c). But f(f) = (p* A {c) > /(0) = logdet(AA*) = log| Y*i > log(minj dj) where we used 

Claim 7.3 in the second inequality. Combining this inequality with Claim 7.7, we conclude 



max tj < 

i ' 



a \ mm; rf; / 
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7.2 Strict Convexity 

Here we prove that if a candidate solution f is close to achieving the optimal value then it is also 
close to the optimal solution t*. This is not a vacuous property since if a convex function / is 
not strictly convex, being close to the optimal value for the objective function does not imply 
that a solution is close to the optimal solution. 

The catch is that our function / is not strictly convex on all of IR m . If t a denotes the vector 
obtained from f by adding the constant a to all coordinates in f, then for every a, f(t a ) = /(f) 
(where here we use the condition that J^j Cj = n). Hence, there are points t, f ' at arbitrary distance 
that satisfy /(f) = f(t'). However, we can show that this is the only scenario in which the function 
is not strictly convex. 

Definition 7.8. Let us say that s,t e IR m are fa-separated if ||(s + al) - til,*, > b for every a € ]R. 
Here, 1 denotes the all ones vectors. 

This definition leads to the next lemma. 

Lemma 7.9. Let s,t e IR m be any two b-separated points for some b > 0. Assume all coordinates of s, t 
are non-negative and that for every i,j e [m] there exists S C [m] with \S\ = n - 1 such that dsu{i} ^ 
and d SU {j}±o- Then, 

^ ( ^ )<M ^_ 6 , expHii+i)(||5|U+||tU) ^ 

We defer the proof of this lemma to Appendix A. With the previous lemma we will later 
argue that whenever /(f) is very close to optimal, then f itself cannot be separated from an 
optimal solution by much. 

7.3 An Algorithm 

Our next theorem gives a polynomial time algorithm for computing the radial isotropic position. 
The assumptions are slightly stronger than simply asking that c € K^. 

Theorem 7.10. Let e > and a > 0. Let A - [u 1 ,...,u m ] e ]R mx " with m > n and rank(A) = n. 
Further assume that for every i,j e [m] there exists S C [m] with \S\ = n — 1 such that dgyj^ and 
^Su(;')*o- Then, given A and any point c € (1 - a)K^, we can compute a nxn matrix R such that 

jC''(iSME§i)= u - + '- 

where WJW^ ^ e. The running time of our algorithm is polynomial in l/y,log(l/£) and L where L is 
an upper bound on the bit complexity of the input A and c. 

Proof: We will apply the Ellipsoid method as described in [33] (Theorem 4.1.2.) to solve the 
optimization problem sup t6]R ,„(c, f) - (pj^{t) over the set of all f € M m satisfying HfH^ ^ B where B 
is the parameter from Lemma 7.6 and tj ^ for all e [m]. Let s denote the solution computed 
by the Ellipsoid method and suppose we have |/(s) -/(f*)| ^ b 2 where 

< e'-minj d± 

^ e O(Bn) det(AA *)' 
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and e' is a sufficiently small quantity that we will bound later. With b chosen this small it follows 
from Lemma 7.9 that t* and s cannot be <5-separated. (Otherwise would give a solution 
improving the optimum.) Here, we used the fact that Y*jei s j ^ Bn for every I C [m], \l\ = n and 
therefore 

e cj>(s) ^ e log(^det(AA*)) = e Bn dei{AA * )t 

Similarly we get the same bound for for (p(t*). Hence, we conclude that s must be <5-close to an 
optimal solution in each coordinate. This implies (using standard perturbation bounds for the 
inverse of a matrix) that the optimality conditions from Lemma 6.1 are approximately satisfied 
for s in the sense that 

(uj, (Ae s A*)~ 1 Uj) ' 

with Ej 6 [-e', e'\. Consider the positive definite matrix M = Y^j e s 'Uj <g> Uj. Its inverse square root 
R = M~ 1/2 satisfies 

^ RU;®Rll; ^ RU;®Rll; 
;=1 i ]=1 

Let / = TJLi £jC; »i® ,n' denote the error term above. It is not hard to show that for e' = 

' i I \\Ruj\\ i 

£/exp(poly(L))), we have that WJ'W^, < e. Since the dependence on 1/6 in the Ellipsoid method is 
logarithmic, the running time remains polynomial in L, l/a and log(l/£). ■ 



Concluding Remarks 

Here we gave a polynomial time deterministic estimator for d-dimensional linear regression 
in IR" that has a breakdown point of 1 - j and gave evidence that there is no polynomial time 
estimator that has a better breakdown point. We explored this question based on connections to 
small set expansion, matroid theory and functional analysis. The most important open question 
is to understand the tradeoffs between efficiency and robustness for other inference problems, 
for example for the more challenging problem of estimating the covariance (see [24]). Computa- 
tional complexity is the obstacle to using the known robust statistical methods. Conversely it is 
possible that the obstacle to using known estimators with provable computational guarantees is 
their sensitivity to violations in the model. 
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A The Defect Lemma 



Here we prove Lemma 7.9: 

Proof: As we did in the proof of Lemma 7.4, we will apply the Cauchy-Schwarz inequality to 
the vectors u,v indexed by I c [m], \I\ = n and defined as 

We'd like to determine how much slack we have in this inequality. Let us therefore lower bound 
j , (u,v) \ 2 _ ||u|| 2 |M| 2 -(u,v> 2 _ \ Li*j(i*iVj ~ u J v i) 2 



|M||||vi| / IIm|| 2 IMI 2 Hm|| 2 |M| 2 



where the last step is Lagrange's identity. Now write tj = Sj + 2aj. By the assumption that s, t are 
fo-separated we must have that max;y e [ m y ^ |a, - fly| > b. Let i,j be a pair of indices achieving 
the maximum. Without loss of generality assume that fly > a\ + b. Let S c [m] be a set of size 
|S| = n - 1 such that I - S U {i} and / = K U {;'} satisfy dj * and dj * 0. Such a set S must exist by 
our assumption. Then: 



(ujVj -UjVj) -Ue 2 - e 2 ) e L -i^ s 2 djdy 



where the inequality follows because s ; , f ; - ^ for all z. On the other hand, (e a i-e a ') 2 - (e b -l) 2 e 2a '. 
But e x -I ^ x and fly > — ll s lloo- Thus: 

(ujVj - UjVj) 2 > y with y = b 2 mindf ■ e~" s "' x ' . 

Therefore: 

(u,v> \ y 



^-idkir ( 3 ) 



M V / M V 



On the other hand 

||u|| = ^]det(Ae s A*) ^ e" l|s|l » det(AA*), ||v|| = ^]det{Ae T A*) ^ e" mL det(AA*) . 
Taking logarithms on both sides of (3), we get 

, / (u,v) \ 1 e -"(ll s ll» + ll f H») 



|m||||v||/ 2 det(AA 
where we used that log(l -x) < -x for all 1 > x > 0. ■ 
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